Goto

Collaborating Authors

 train-test split


Limitations

Neural Information Processing Systems

While our study identifies clear separations between model hypothesis classes, our best models still have not reached the consistency ceiling of the neural and behavioral benchmarks we have compared against. All models were simultaneously trained across all eight scenarios of the Physion Dynamics Training Set, constituting around 16,000 total training scenarios (2,000 scenes per scenario) [Bear et al., 2021], with a Each C-SWM [Kipf et al., 2020] model was trained on For each stimulus, we compute the proportion of "hit" responses by The Correlation to A verage Human Response is the Pearson's correlation between the model probability-hit vector and the human proportion-hit vector, across stimuli per scenario. OCP Accuracy of humans and models is the average accuracy, across stimuli per scenario. To give the final values of the two quantities, we then compute the weighted mean and s.e.m. of the above per Note that these values are therefore different for each condition, but always the same across all models. All neural predictivities are reported on heldout conditions and their timepoints.


Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

Fircă, Liviu Nicolae, Bărbălau, Antonio, Oneata, Dan, Burceanu, Elena

arXiv.org Artificial Intelligence

Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.


Supplementary: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning A Analyzing the model bias for selecting train-test splits

Neural Information Processing Systems

These settings are used throughout our study. In Tab. 1 we show the measured FID scores between each For each dataset we show examples for an easy, medium and hard train-test split. Tab. 2 first illustrates the FID scores for all pairwise combinations However, the fact that FID scores are relatively close to another despite large semantic differences between datasets may indicate that FID based on our utilised FID estimator (Sec. This section provides additional results for the experiments presented in Sec. 4 in the main paper. To this end, we provide the exact performance values used to visualize Figure 1 in the main paper in Tab.



Interpretable Generalized Additive Models for Datasets with Missing Values

Neural Information Processing Systems

Many important datasets contain samples that are missing one or more feature values. Maintaining the interpretability of machine learning models in the presence of such missing data is challenging. Singly or multiply imputing missing values complicates the model's mapping from features to labels.



Zero-Shot Performance Prediction for Probabilistic Scaling Laws

Schram, Viktoria, Hiller, Markus, Beck, Daniel, Cohn, Trevor

arXiv.org Artificial Intelligence

The prediction of learning curves for Natural Language Processing (NLP) models enables informed decision-making to meet specific performance objectives, while reducing computational overhead and lowering the costs associated with dataset acquisition and curation. In this work, we formulate the prediction task as a multitask learning problem, where each task's data is modelled as being organized within a two-layer hierarchy. To model the shared information and dependencies across tasks and hierarchical levels, we employ latent variable multi-output Gaussian Processes, enabling to account for task correlations and supporting zero-shot prediction of learning curves (LCs). We demonstrate that this approach facilitates the development of probabilistic scaling laws at lower costs. Applying an active learning strategy, LCs can be queried to reduce predictive uncertainty and provide predictions close to ground truth scaling laws. We validate our framework on three small-scale NLP datasets with up to $30$ LCs. These are obtained from nanoGPT models, from bilingual translation using mBART and Transformer models, and from multilingual translation using M2M100 models of varying sizes.




Supplementary: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning A Analyzing the model bias for selecting train-test splits

Neural Information Processing Systems

These settings are used throughout our study. In Tab. 1 we show the measured FID scores between each For each dataset we show examples for an easy, medium and hard train-test split. Tab. 2 first illustrates the FID scores for all pairwise combinations However, the fact that FID scores are relatively close to another despite large semantic differences between datasets may indicate that FID based on our utilised FID estimator (Sec. This section provides additional results for the experiments presented in Sec. 4 in the main paper. To this end, we provide the exact performance values used to visualize Figure 1 in the main paper in Tab.